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ABSTRACT 



A compiler facilitates efficient unrolling of loops and 
enables the elimination of extra branches ft'om the loops* 
including the elimination of conditional brandies from 
unrolled loops with early exits. Unrolling also enhances 
other optimizations, such as prefetch, scalar replacement, 
and instruction scheduling. The unroll factor is calculated to 
determine the amount of loop expansion and the optinuun 
location to place compensation code to complete the original 
lo<^ count, i.e. befcffe or after the unrolled loop. The 
compiler is applicable, for example, to modern RISC 
architectures, where the latency of memory references and 
branches is higher than that of integer and floating point 
arithmetic instructions. 
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I=N 

BEGIN LOOP: 



450 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_LOOP; 



END_LOOP: 
PRINT I 



Fig. 3 

(PRIOR ART) 



I = N 

BE GTN.LOOP: 

A[I] = B[I] + K 
1 = 1-1 



401 



IF ( 1 < 0) GOTO END_LOOP 

-405 



C[IJ = A[I] + X 



GOTO BEGIN_LOOP 
END.LOOP: 

PRINT I 



403 



Fig. 7a 
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BEGIN LOOP: 



A[I] = A[I-2] / B[l] 
1 = 1 + 1; 



367 



IF (I < N) GOTO BEGIN_LOOP 
END_LOOP: 

Fig. 9 

(PRIOR ART) 
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^301 

I = N j/ 

IF (I<4) GOTOBEGIN_COMPENSATION_LOOP 

UNROLLED LOOP: 



A[I]=B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I] -H X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 




333 



IF (I >= 4) GOTO UNROLLED.LOOP 
END UNROLLED LOOP: 



159 



IF (I = 0) GOTO END_COMPENSATION_LOOP 
BEGIN COMPENSATIONLOOP: 



303 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 

Fig. 4 
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I = N 

IF (1 < 5) GOTO BEGIN_COMPENSAT10N_LOOP 
UNROLLED LOOP: 



A[I] = B[I]+X 
1 = 1-1 



A[ri = B[I] + X 
I = T-l 



A[ri = B[I] + X 
1 = 1-1 



A[1]=B[I] + X 
1 = 1-1 




321 



IF (I >= 5) GOTO UNR0LLED_L0OP 
END_UNROLLED_LOOP: /-323 
COMPENSATION.LOOP: J 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_C0MPENSATION_L00P; 



END_COMPENSATI0N_LOOP: 
PRINT I 



Fig. 5 
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I = N 

J = ( (N-l) %) 4 + I NOTE: ( (N-=l) MOD 4) + 1 

I = N - J ; NOTE THAT I IS GUARANTEED 

TO BE DIVISIBLE BY 4 HERE. < ^ 32U 

COMPENSATION.LOOP: 



AIJ] = B[J1 + X 
J = J-1 

IF (J > 0) GOTO BEGIN_COMPENSATION_LOOP: 



END_COMPENSATION_LOOP: 
IF (I = 0) GOTO END_UNROLLED_LOOP 
UNROLLED LOOP: 



383 



A[I1 = B[I] + X 
1 = 1-1 



A[ii = em + X 
1 = 1-1 



A[I] = B[I1 + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 
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IF (I > 0) GOTO UNROLLED_LOOP 
END_UNROLLED_LOOP: 
PRINT I 



Fig. 6 
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I = N 

BEGIN.LOOP: 
UNROLLED LOOP: 



A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A(1J + X 



A[I] = B[I] + K 
1 = 1-1 

IF ( 1< 0) GOTO END_COMPENSATION_LOOP; 
C[I]=A[I] + X 



ALIJ = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_C0MPENSATI0N_LOOP; 
C[I] = A[I] + X 




A[I] = B[I] + K 
1 = 1-1 

IF (I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I] + X 



END_UNROLLED_L00P: 
BEGIN COMPENSATION_LOOP: 



A[I] = B[I] + K 
1 = 1-1 



365 



IF (I < 0) GOTO END_COMPENSATION_LOOP; 



C[I] = A[l] + X 



GOTO BEG1N_C0MPENSATI0N_L00P 
END.LOOP: 
PRINT I 



Fig. 7b 
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I = N 

IF (T < 5) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED LOOP: 



A[I]=B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I]=B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 



A[I]=B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 




319 



312 



IF (I>=5) GOTO UNROLLED_LOOP; 
END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 

361 



A[I] = B[I] + K 
1 = 1-1 



365 



IF (I<0) GOTOEND_COMPENSATION_LOOP; 

363 



C[I] = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END.LOOP: 

PRINT 1 pig^ g 
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Fig. 10 

^ (PRIOR ART) 



= T/Brii ^ 



391 



A[I] = T/B[I1 
T = T1 < 
Tl = A[I] 
1 = 1+1; 



IF (I < N) GOTO BEGIN_LOOP 



393 
395 



ENDLOOP: 



T = A[I-2] 
Tl = A[I-1] 

BEGIN LOOP: 



A[I] = T/B[I] 
T = T1 
Tl = A[I1 
1 = 1+1; 




A[I]=T/B[I] 

T=:T1 

Tl = A[I] 
1 = 1+1; 




A[I] = T/B[T] 

T = Tl-< 

Tl = A[I] 
1 = 1+1; 



388 



Fig. 1 1 
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IF (1 < N) GOTO BEG1N_L00P 
END LOOP: 
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T = A[I-2] 
T1=A[I-1] 



400 



BEGIN LOOP: 



A[I]=T/B[I] 
T2 = A[I] 



A[I+1] = T/B[I+1] 
T = A[I+1] 



401 



402 



A[I+2] = T2/B[I+2] 
Tl = AP+2] 



_;-403 
410 



1 = 1 + 3; 
IF (I < N) GOTO BEGIN_LOOP 
END LOOP: 



Fig. 12 
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DETERMINE TRIPCOUNT IF KNOWN 

DETERMINE UNROLL FACTOR FROM LOOPLENGTH,PREFETCH 
SPAN,PROFILE,RESOURCE USAGE 



I 



IF TRIPCOUNT KNOWN AND TRIPCOUNT 
MOD UNROLLFACTOR - 0 



222 



YES 



203 



NO 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,NOCOMP) 

REMOVEMIDDLEEXITS (ULCODE) 
EXCEPT THE FINAL EXIT 



1 



205 



ULCODE= 

UNROLL(L0OP,UNROLLFACTOR,COMPOK) 
REMOVEMIDDLEEXITS (ULCODE) 



242 



OUTPUT(ULCODE) 



NO 




226 



ULCODE=MOVE FINAL 
EXIT TO END OF LOOP 
(ULCODE) 



244 



1 



229 



IF PLACECOMP( UNROLL FACTOR) = BEFORE 



246 



OUTPUT ULCODE 
OUTPUT COMPCODE FOR LOOP 



I 



OUTPUT COMPCODE FOR LOOP 
OUTPUT ULCODE 



999 



DONE 



Fig. 13 
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INTELLIGENT LOOP UNROLLING 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The invcDtion relates to compilers. More particularly, the 
invention rcUtcs to techniques for unrolling coniputaticn 
loops in a compiler so as to generate code which executes 
faster. 

2. Description of The Prior Art 

FIG. 1 is a block sdicnaatic diagram of a uniprocessor 
computer architecture, including a processor cadic. la the 
figure, a processor 11 includes a cache 12 which Is In 
comnuJoicatioD with a system bus 1$. A system memory 13 
and one or more VO devices 14 are also in commumcation 
with the system bus. 

FIG. 2a is a block schematic diagram of a software 
compiler 20, for exam{>le as may be used in connection with 
the computer ardiitecture shown in FIG. 1. The compiler 
Front End component 21 reads a source code iUe (100) and 
translates it into a high level intermediate representation 
(110). A high level optimizer 22 optimizes the high level 
intermediate representation 110 into a more ^cient form. A 
code generator 23 translates the optimized high level inter- 
mediate representation to a low level intermediate represen- 
tation (120). The low-level optimizer 24 converts the low 
level intermediate refs-esentation (120) into a more cfBcient 
(machine-executable) form. Finally, an object file genexatcff 
25 writes out the optimized low-level intermediate repre- 
sentation into an object file (141). The object file (141) is 
processed along with other object files (140) by a linker 26 
to produce an executable file (150), which can be nin on the 
computer 10. In the invention described herein, it is assumed 
that the executable file (150) can be instrumented by the 
compila (20) and linker (26) so that when it is run on the 
computer 10, an execution profile (160) may be generated, 
which can then be used by the low level optimizer 24 to 
better optimize the low-level intermediate representation 
(120). The compiler 20 is discussed in greater detail below. 

The compiler is the piece of software that translates 
source code, such as C, BASIC, or FORTRAN, into a binary 
image that actually runs on a machine, lypically the com- 
piler consists of multiple distinct phases, as discussed above 
in connection with FIG. 2a. One phase is referred to as the 
front end, and is responsible for checking the syntactic 
cocrccmess of the source code. If the compiler is a G 
compiler, it is necessary to make sure that the code is legal 
C code. There is also a code generation phase, and the 
interface b^eeo the front-end and the code generator is a 
high level intermediate r^esentation. Hie high level inter- 
mediate rq)resentation is a more refined series of instruc- 
tions that need to be carried out. For instance, a loop might 
be coded at the source level as: 

which might in fact be broken down into a series of steps. 
c.g. each time through the loop, fint load up I and dieck it 
against 10 to decide whether to execute the next iteration. 

A code generator takes this high level intermediate rep- 
resentation and transforms it into a low level intermediate 
representation. This is closer to the actual instructions &at 
the computer understands. An optimizer conoponent of a 
compiler must preserve the program semantics (i.e. the 
meaning of the instructions that are translated from source 
code to an high level intermediate representation, and thence 
to a low level intermediate representation and ultimately an 
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executable file), but rewrites or transforms the code In a way 
that allows die computer to execute an equivalent set of 
instructions in less time. 
Modern compilers are structured with a high level opti- 

5 mizer (HLO) that typically operates on a high level inter- 
mediate representation and substitutes in its place a more 
efficient high level intermediate representation of a particu- 
lar program that is typically shorter. For exan^le, an HLO 
might eliminate redundant computations. With the low level 

10 optimizer (LLO). the objectives are the same as the HLO, 
except that the LLO operates on a representation of the 
program that is much closer to what the machine actually 
understands. 

FIG. 26 is a block diagram showing a low level optimizer 
IS for a compiler, including a loop unrolling con^>onent 30 
according to ttie invention. The low level optimizer 24 may 
include any combination of known optimization techniques, 
such as those that provide for local optimization 35, global 
optimization 36, loop identification 37, loop invariant code 
20 motion 38, prefeidi 34 jegister reassodatioo 31 . and instruc- 
tion scheduling 32. 

Source programs translated into machine code by com- 
pilers consists of loops, e.g. DO loops, FOR loops, and 
WHILE loops. Optimizing the compilation of such loops 
25 can have a major effect on the run time performance of the 
program generated by the con^>iler. In some cases, a sig- 
nificant amount of time is spent doing such t>ookkeeping 
functions as loop iteration and branching, as 0{^>osed to the 
computations that are performed within the loop itself. 
30 These loops often implement scientific applications that 
manipulate large arrays and data instmctions, and run on 
high speed processors. 

This is particularly true on modern processors, such as 
RISC architecture machines. The design of these processors 
35 is such that in general the arithmetic operations opiate a lot 
faster than memory fetch operations. This mismatch 
between processor and memory speed is a very significant 
factor in limiting the performance of microprocessors. Also, 
branch instmctions, both conditional and unconditional. 
40 have an increasing effect on die performance of programs. 
This is because most modern architectures are super- 
pipelined and have some sort of a branch prediction algo- 
rithm tiiq>lemented. The aggressive pipelining makes the 
branch misprediction penalty very high. Arithmetic instruc- 
43 tions are intcrrcgistcr instructions that can execute quickly, 
while the brandi instructions, because of mispredictions, 
and menKiry instructions such as loads and stores, because 
of slower memory speeds, can take a longer time to execute. 
Modern conq)ilcrs perform code optimization. Code opti- 
50 mizatioo consists of several operations that improve the 
speed and size of the conq)licd code, while maintaining 
semantic equivalence. Common optimizations include: 
prefetching data so that they are available in cache 

memory when needed; 
detecting calculations as computing constants and per- 
forming the calculation at conQ)ilc time; 
scalar replacement whidi keeps the value of a variable in 
a register within the loop; 
^ moving calculations outside of loops where possible; and 
performing code scheduling, which consists of rearrang- 
ing the orda of and modifying instructions to achieve 
faster running but semantically equivalent code. 
Many modem compilers also employ an optimizing tech- 
63 nique known as loop unrolling to generate faster running 
code. In its essence, loop uru'oUlng takes the inner loop. i.e. 
the code between the t>eginning and the end of the loop, and 
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repeats it in the inner loop some number of times, e.g. four 
times. IC then executes the unrolled loop one-fourth as many 
times as it would have executed the original loop. The 
number of times the loop Is replicated within the unrolled 
Ioq> is called the unroll factor. Because the number of times 
the original loop is executed is not always divisible by the 
unroll factor, a compensation loop code often has to be 
gcaeratcd to execute the remaining of instructions c£ the 
original loop that arc not executed by the unrolled loop. 

As discussed above, such loops as DO. FOR and WHILE 
loops are common in programs, especially in scientific and 
other time-consuming programs. Frequently 80% of the 
running time of a program can be in a few small loops. As 
a result anything that can speed up such loops is of great 
value in making a more efficient compiler. 

Consider the simple loop shown on FIG. 3. The three 
instructions in the inner loop 150 are executed N times. 
According to the prior art this loop can be unrolled with an 
unroll factor of four to produce the code shown on FIG. 4. 
where the inner loop (less the exit condition) is rq>licated 4 
times 333. This loop is also followed by a test 303 to see if 
the full original loop has t>een completed, and a compensa- 
tion loop 155 which is executed to the complete the original 
loop trip count if it has not been completed. 

An inspection of the loop shows that It is semantlcally 
equivalent to the loop of FIG. 3 because the same function 
is performed and the same result is achieved. However, the 
loop of FIG. 4 runs much faster for several reasons. First, the 
conditional branch 159 which exits the loop is executed only 
once for each time through the unrolled loop rather than 
once for each time through the original loop. Assuming an 
unroll factor of four as is shown here, this saves % of a 
conditional branch per original loop iteration. More 
importantly, other compiler optimizations interact with loop 
unrolling and are able to do a much bcaer job of optimizing 
the unrolled loop, such as that identiiied by numeric desig- 
nator 333. as compared to &e original loop, identified by 
numeric designator 150. In an unrolled loop, there are more 
operations that could be scheduled In parallel, more oppor- 
tunity to do scalar replacement and other optimizations, and 
more possibilities to do prefctdiing. 

The tests identified by numeric designators 301 and 303 
are also of interest. These arc the conditional branches whidi 
have a higher jn^obabiiity of being mispredicted. Anydiing 
that can be done to eliminate one of them will be very useful. 

Prior art loop unrolling techniques have certain disadvan- 
tages. For example. J. J. Dongara and A. R. Hinds, Unrolling 
Loops in FORTRAN, describes how one can unroU loops 
manually by duplicating code. This is an early solution to 
optimizing code that it is not even implemented by the 
compiler. 

S. Weiss and J. E. Smith. A Study of scalar compilation 
techniques for pipeline supercomputers, discuss unrolling in 
the con^Mler. Here the authors address the simple situations: 

a) Cases where the loop count is known at compile time. 
They do not address loop unrolling when the loop count 
is only known at mn time. 

b) Cases where the loop exit appears only at the beginning 
or end of the lo<^. They do not address the situation of 
unrolling loops with early exits (loops whose exit may 
occur in the middle of the loop). 

The authors do not address the following issues: 

a) Determining whether to place the compensation code 
before or after the unrolled loop. 

b) 'Hining of the iteration count to reduce branch mispre- 
diction. 

c) Factors that affect the unroll factor. 
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L. J. Heodrcn and G. R. Gao. Designing Programming 
Languages for the Analyzability of Pointer Data Structures, 
addresses the issue of unrolling loops as part of conqiiler 
optimization. They do not however discuss: 
3 a) Unrolling loops with early exits; 

b; Tuning of the iteration count to reduce branch mispre- 
diction; 

c) Factors that affect the unroU factor; 
]Q d) Whether to place a coir^nsation code at the beginning 

or end of the loops; and 
e) Compiling loops whose trip count is only known at run 

time. 

J. Davidson, S. Jinturkar. Aggressive Loop Unrolling in a 
15 Retargetable, Optimizing Compiler, Etept of Con^JUter 
Science. Thornton Hall, University of Miginia disclose a 
code transformation, refcaed to as aggressive loop 
unrolling, in a retaigetable optimizing compiler where the 
loop bounds are not known at compile time. Various factors 
20 were analyzed to determine how and when loop unrolling 
should be ^lied. resulting in an algorithm for loop unroll- 
ing in which execution-time counting loops (i.e. a counting 
loop whose iteration count is not trivially known at ccHnplle 
time) are unrolled and loops having complex control-flow 
25 are unrolled. However, they do not discuss: 

a) Unrolling loops having early exits; 

b) Tuning of the iteration count to reduce brandi mispre- 
diction 

c) Factors that affect the unroU factor; and 

^ d) Whether to place a compensation code at the beginning 
or end of the loops. 
Another part of the loop unrolling prior art is shown on 
RG. 7, which illustrates a loop having an eariy exit (also 
referred to as a WHILE loop), consbting of an exit test and 
branch 403 in the middle of the loop between computations 
401 and 405. According to the prior art, the code, including 
the exit 403. is replicated four times in an unrolled loop. The 
number of branches in the loop are however not reduced 
with this optimization. 

^ While some of the techniques discussed in the prior art are 
applicable to compilers for ail computers, only some of them 
are particularly i^licable for modem RISC conq>uters. 
where branch instructions form a lot bigger bottleneck than 
in earlier technologies. Also compilers for RISC architec- 
tures are a lot more aggressive and the interactions of 
various optimizations plays a key role in the quality of the 
final code. 

SUMMARY OF THE INVENTION 

50 

The invention provides a new compiler that can unroll 
more loops than previous algorithms. It also significantly 
reduces the number of branch instructions by clcveriy han- 
dling the iteration count and by converting loops witii early 
exits to regular FOR loops. The invention also provides for 
computing the unroll factor and the placement of the com- 
pensation loop by taking a lot of oOkt optimizations into 
consideration. 

The compiler 

60 Eliminates time consuming conditional branch instruc- 
tions from the compensation code loop by replacing the 
conditional exit of the main unrolled loop to always 
exit with at least one iteration which has yet to be 
executed by the compensation code. This eliminates the 

6s need to test for zero remaining loops. 

Determines whether it is better to place the compensation 
code at the beginning or the end of the unrolled loop 
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according to which one would likely provide the better 
optimization. Generally, it prefers to put the compen- 
sation Loop in front of the main loop if the unroll factor 
is a power of two and after the main loop if the uoroU 
factor is not a power of two. s 

Conqxjtes the unroll factor by taking into account the 
ioteractioDS of other optimizations Like prefetch, scalar 
replacement and register allocation, and also taking 
into account hardware features like number of func- 
tional units. Unrolling loops ovcr-aggresslvely or 
uoder-aggressively can inhiUt other optimizations or 
make them less effective, 

Convcxts loops with early exit to loops with C3dt at the end 
to apply more cfiBcient optimizations to the loop. It does 
this by ensuring that the compensation code is always 
executed at least once, enabling the compiler to elimi- 
nate the exit tests from the unrolled loop. 
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FIG. 1 shows a block schematic diagram of a uniprocessor 
computer architecture including a processor cache; 

FIG. 2a shows a block sdiematic diagram <^ a modem 
software compiler; 

FIG. 2b shows schematic diagram of the low level opti- 25 
mizcr; 

FIG. 3 shows a single program loop; 

FIG. 4 shows the loop of FIG. 3 that has been unroUcd 
four times acc<^ding to the prior art; ^ 

FIG. S shows the loop of FIG. 3 unrolled according to the 
invention and having an eliminated brandi; 

FIG. 6 shows an unrolled loop having pre-loop conq>en- 
sation code; 

FIG. 7a shows a simple WHILE loop; 35 

FIG, lb shows the WHILE loop of FIG. 7a that has been 
unrolled according to the prior art; 

FIG. 8 shows the WHILE loop of FIG. la that has been 
unrolled acccrdiog to the invention; ^ 

FIG. 9 shows a schematic representation of a sin4>le loop; 

FIG. 1© shows the loop of FIG. 9 with scalar replacement 
according to the pri<M' art; 

FIG. II shows the loop of FIG. 10 unrolled; 

FIG. 12 shows a schematic representation of the loop of *^ 
FIG. 11 after copy elimination; 

RG. 13 shows a schematic representation of the compiler 
logic which determines the unroll factor, compensation code 
placement and other optimizations; and ^ 

FIG. 14 is a block schematic diagram of a compiler for a 
progranunable machine in accordance with the invention. 
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The invention provides a new con^iler that features smart 
unrolling of loops. 

The invention provides a prefetch driver 34 that operates 
in concert with such known techniques. The following 
discussion pertains to the various elements of the low level 60 
optimizer shown on FIG. 2b: 

Local optimizations include code improving transforma- 
tions that are applied on a basic block by basic block basis. 
F6r purposes of the discussion herein, a basic block corre- 
sponds to the longest contiguous sequence of machine 65 
instructions without any incoming or outgoing control 
transfers, excluding function calls. Examples of local opti- 



mizations include local common sub-expression elimination 
(CSE). local redundant load elimination, and peephole opti- 
mization. 

Global optimizations include code improving transforma- 
tions that arc applied based on analysts that spans across 
basic block boundaries. Exan^les include global common 
sub-expression elimination, loop invariant code motion, 
dead code elimination, register allocation and instniction 
scheduling. 

Loop invariant code motion is the identification of 
instructions located with a loop that compute the same result 
on every loop iteration and the re-positioning of such 
instructions outside the loop t>ody. 

Register allocation and instruction scheduling is the pro- 
cess of assigning hardware registers to symbolic instruction 
operands and the re-ordering of instructions to minimize 
run-time pipeline stalls where the processor must wait on a 
memory fetch from main memory or wait for the completion 
of certain complicated instructions that take multiple cycles 
to execute (eg. divide, square root instructions). 

One imponant phase of the compiler identifies loops and 
access patterns to estimate how many cycles are devoted to 
loop iterations. In the invention, the compiler translates the 
hi^er level application code into an instruction stream that 
the processor executes, and in the process of this translation 
the conq)Uer unrolls loops. 

The longer unrolled loops allow the compiler to provide 
several advantages, such as: 

1) It eliminates the extra branch exits. This saves CPU 
cycles by not having to execute the branch instructions 
and also helps reduce branch mi^ediction. Why this is 
important is evident if one considers most modern 
RISC architectures. These architectures have a long 
pipeline that is fed by an instniction fetch mcchanisnL 
When the fetch mechanism encounters a branch, it tries 
to predict if the branch is going to be taken or not. It 
thf^ fetches instructions based on this prediction. The 
prediction is necessary to keep the pipeline from stall- 
ing. If the architecture's prediction is correct (this is 
determined when the branch instruction completes 
execution which is a few cycles after it has been 
fetched), dien everything works fine; eke all the 
instructions diat have been fetdied after the branch arc 
discarded and Ae new instructions fetched based on the 
correct outcome of the branch. This penalty of discard- 
ing fetched instructions and fetching new ones, when 
the branch is misprediicted. is known as the branch 
mi^ediction penalty and it is very significant for most 
modern architectures. It is of the order of 5-10 cycles 
per branch instruction that is mispredicted. By reducing 
the branch instructions, the number of branches that get 
mispredicted automatically reduces. 

2) It can better insert prefetches and ^ect other optimi- 
zations into the longer inner loop code. When the loop 
is unrolled, there arc more memory instructions in the 
loop and also the memory stride (the distance between 
the memory accesses of an instruction in two consecu- 
tive iterations) is bigger. If a loop is unrolled four times, 
the memory stride goes up by four. This he^s the 
prefetch to do a more effective job. When the memory 
stride inaeascs, as long as it is less than the cache line 
size (which is architecture dependent), the prefetches 
become nkore effective. When the memory stride 
becomes greater than the cache line size the prefetches 
can hurt Hence the loops should be unrolled such that 
the memory stride is lesser than the cache line size 
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whenever prefetch instructions are going to be gener- 
ated for the loop and whenever possible. 

3) Scalar replacementyrecurrence elimination inserts cop- 
ies at the loops to keep the value of a variable live in 
the next iteration (sec. for example FIGS. 9. 10, and 
11). These cojMes can be elinunated by unrolling the 
loop a certain number of times. 

4) Longer sequences of non-branching instructions can 
achieve an overlap between instructions that have noth- 
ing to do with memwy and those that do. This is known 
as instruction scheduling and explained below. While 
the access time between the processor and the cache is 
typically 1 to S cycles, the retrieval time torn cache to 
memory is often on the order of 10 to 100 cycles. When 
the processor actually gets to the point where the data 
item is needed from memory, if the data is not in cache, 
it might take 100 processor cydes to fetch it from main 
memoiy. Where the compiler can optimize the longer 
inner loop code, it may only be necessary to wait for 20 
cydes because 80 cycles worth of look up time is 
hidden or overl^pcxl wifli the execution of other 
instructions. 

Loop unrolling is integrated with other low level optimi- 
zation phases, such as the prefetch insertion algorithm, 
register reassodation. and instruction scheduling. Hie new 
con^iler yields significant perfoimance improvements for 
some industry- standard performance benchmarks, for 
example on the SPEC92 and SPEC95 bendmiarks on die 
Hewlett-Packard Con^any OPalo Alto. Calif.) PA-8000 pro- 
cessor. 

The following discussion explains compiler operation in 
the context of a loop within an application program. Loops 
arc readily recognized as a sequence of code that is itcra- 
tivdy executed some number of times. The sequence of such 
operations is predictable because die same set of operations 
is rq>cated for each iteration of the loop. It is common 
practice in an application program to maintain an index 
variable for each loop that is provided with an initial value, 
and that is incremented by a constant amount for each loop 
itnatioD until the index variable reaches a final value. Hie 
index variable is c^n used to address elements of arrays 
that correspond to a regular sequence of memory locations. 

In the compiler, it has been found that the low levd 
optimizer component of a compiler is in a good position to 
deduce the number of cydes required by a stretch of code 
that is repetitively executed and this information can be used 
to deteimine the optimal unroll factor. As discussed at>ove. 
the concept of loop unrolling is not new. but use of smart 
unrolling is new. For example. FIG. 4 shows the loop of FIG. 
3 aAer tiie loop has been unrolled four times. Thus, instead 
of executing the loop 100 times if N were 4. the loop is 
executed 25 times. 

FIG. 5 shows the output code that is generated by the 
invention in contrast to the code generated by the prior ait. 
as shown on FIG. 4, The replicated inner loops 333 are the 
same. Also, the coix4)ensation loop 323 is the same as die 
prior an compensation loop 155 of FIG. 4. However, the 
loop test at 3tl and 159 of FIG. 4 now tests to exit if I>=S 
rather than I>=4. as can be seen at 311 and 321 of FIG. 5. 
The effect is to ensure that the compensation loop is always 
executed at least once. This eliminates the need to test for the 
zero case (303 in FIG. 4). This eliminates the branch 
instniction 303 on FIG. 4. As indicated above, the elimina- 
tion of this branch instruction signihcantly increases the 
speed of the compiled code by redudng the number of 
branch instructions that get ndspredicted. 

It is also possible to put the compensation code in front of 
the main loop, as is shown on FIG. 6. Here the compensation 
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loop 3»3 is in front of the repetitive unrolled loops 347. In 
the general case, putting the compensadon code before the 
main unrolled loop is less efficient than putting afterwards, 
because calculating die loop trip count requires a remainder 

s operation which involves high latency divide operations. 
However, if the unroll factor is a power of two, as in this 
case where the unroll factor is 4. the remainder calculation 
is a simple shift operation. Because unroll factors of 2. 4. or 
8 are common, the compensation code can be placed in front 

10 in front of the unroll loop for negligible cost. As a practical 
matter, it is often advantageous to put the compensation loop 
in front of the unrolled loop to benefit from other optimi- 
zations such as register reassociation. When the compensa- 
tion loop is placed before die unrolled loop, die variable that 

IS keeps track of the iteration count is not always needed after 
the unrolled loop. When the condensation loop is placed 
after the unrolled loop, this variable is always needed after 
the unrolled loop as there is an exposed use in the oompcn- 
sation loop. This exposed use can inhibit aggressive register 

20 reassociation. In the preferred embodiment, the architecture 
of the computer and the interactions with other optimiza- 
tions dictate an unroll factor, and if it is a power of two. die 
compensation code is inserted in front of the unroll loop. 
Another optimization technique that is part of die inven- 

25 tion herein disclosed rearranges loops with early exits 
(which are henceforth referred to as WHILE loc^s). These 
loops are characterized by the fact diat some of the inner 
loop code is done befcu-e the loop test and some after the 
loop test as is shown on FIG. 7a. Here the loop has an exit 

30 branch 403 in die middle with inner loop operations 401 
before iu and other inner loop operations 405 after it. 

The optimization taught in the prior ait for this loop is 
shown on FIG. 7b, Notice that the whole inner loop, 
induding the exit instruction, is replicated 377 four times. 

35 This can be improved by conveiting the unroUed loop into 
a FOR loop with an exit condition of (unroll factor +1) as 
opposed to unroll factcs^ (e.g. 5 instead of 4 in this case), as 
is shown on FIG. 8. This guarantees that the unrolled loop 
is exited before it would have to exit due to die WHILE 

40 condition. Because none of the branches at 377 on FIG. 7b 
are executed, the WHILE exit instruction can be removed, as 
is shown at 319 on FIG. 8. Thus, there is only one place that 
there is a WHILE loop exit. i.e. at 365. 
The technique hada disdosed ensures dial the unroUed 

45 loop exits before it would take any of the WHILE loop exits, 
so that the WHILE test can be removed from the unrolled 
loop. It is necessary to ensure that the compensation code is 
always executed at least once. 
The following discusses how the unroll factor interacts 

50 widi die scalar rq>lacement optimization. This is particularly 
important because the form of this optimization determines 
the unroU factor. Consider die loop shown on FIG. 9. Notice 
that the value of A[I] stored in die inner loop at 367 is loaded 
again two loop iterations later, when the same statement 

55 loads A[I-2] with a value of I which is incremented by 2. The 
idea behind scalar optimization which is well Imown in die 
prior art is to save array values in temporary variables if 
they are accessed shortly within the next few iterations. 
Thus, the loop can be modified as is shown on FIG. 10. 

60 Here, the anay reference A| 1-2 1 at 391 is replaced with T, 
and it is followed by two instructions at 393 and 395 which 
assign values to T and Tl. Two scalar temporary variables 
are necessary because the value of the two most recent array 
values must be saved. The value of Ajl-lt would have been 

65 stored in the previous iteration in Tl and that is going to be 
used in the next iteration. We move Tl to T and T will be 
accessed in the next iteration. Similarly Tl, to which A|I| is 
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assigned will be moved to T in the ncxl iteration and used 
2 iterations from now. 

The instruction at 395 appears to make an indexed refer- 
ence to the array A and that suggests that an array access 
must be made to get the number to put into Tl . which would $ 
be a high latency operation and would lose all that was 
gained by the optimization. What actually hai^ns is the 
optimizer recognizes that the value of Al is stored two 
iostnjctioDs earlier at 391. and that A|Ii is resident in a 
register which can be stored into T 1 without accessing A|I |. 
As Tl is likely to be assigned to a register, this operation is 
a register to register instruction. 

The foregoing illustrates how the various optimization 
tedtniques arc inteirclatcd allowing the loop unrolling opti- 
mizer to generate code M^ch is clearly not optimal in itself 
but is optimized by other optimizers in the compiler. Ini- 
tialization code is inserted at 388 to define initial values of 
TandTl. 

BG. 11 shows how such a loop can be unrolled acccf ding 
to the prior art if an uoroil factor of three had been chosen. ^ 
The prior art docs not specify the selection of an unroll 
factor of three here, but a factor of three, or a multiple of 
three is optimal because It allows other parts of the optimizer 
to generate the code shown on FIG. 12. The selection of an 
unroll factor here of three or a multiple of three is important 
because three values of the array must be kept A|I|. A|I-1] 
and A|I-2| if the array references are to be avoided. 

Three variables T. Tl and T2 are used, making it possible 
for other well known code optimization techniques to gen- 
erate code, such as that shown on FIG. 12. This diminates ^ 
shuttling the temporaiy data from T to Tl. This is an 
exan^le of where the nature of the code in the loop and its 
effect on scalar optimization forces a particular unroll factor. 
In gcncj^, one lists the various indexes In the loops and sorts 
tiiem to notice the maximum distance between them. 

In the above case the distance between the index L and 
1-2, is 2. Adding one to this value computes a primary unroll 
factor, which is three in this case. This is an acceptable 
unroll factor. However, if it turned out to be very small, one 
might want to multiply it by a constant to get a larger unroll ^ 
factor. Altemativcly. if the primary unroll factor was very 
large, one might want to divide it by the loop incremeat of 
it was a number other than one. The reasons for selecting a 
particular unroll factor are discussed below. 

Determining the unroll factor. 

Classically the prior art uses a standard unroll factor for 
all loops, lypi^y tiw aumber used is four. In the invention, 
the unroll factor is calculated for each loop depending on 
various factors. At one extreme some loops are not unrolled 
at all. and other loops are unrolled eight or even more times. ^ 
The disadvantages of picking too large an unroll factor are: 

1. All the loop instructions need not fit into the instruction 
cache leading to a lot of I-cache misses. 

2. The higher the unroU factor, the higher the memory 
stride of memory instructions across iterations. If the 55 
memory stride exceeds cache line size, the effective- 
ness of i^rfctch decreases. 

3. Hie resulting code is longer. Usually an upper bound 
must be chosen, an uorall count of 1000 is not likely to 

be a good idea since the compile time can go up 60 
signiiicantiy. Also excessive unrolling can adversely 
affect other optimizations which have bounds on the 
number of transformations they can make. 
On the other hand if a small unroll factor is chosen the 
following problems can occur; 65 
1. Mudi more time is spent executing the high latency 
branch instruction which closes the loop. 



35 



2. The short inner loop provides many fewer opportunities 
for optimization than longer inner loops. Where the 
inner lo<^ has high latency instructions, the compiler 
can often have them execute in parallel with low 
latency time instructions. This may not be possible in 
very short loops. 
One must keep in mind that the con^iicr is compiling 
loops that range from single instruction inner loops to loops 
that have scores or even hundreds of instructions, and so the 
compiler must compile code balance diese considerations to 
achieve good unroll factors. To determine the unroU factor, 
the compiler considers the following in decreasing order of 
importance: 

1. There is a maxirtuun value of the unroll factor (which 
in the preferred embodiment of the invention is eight). 

2. The number of instructions in the unrolled loop must 
not exceed a specific limit. This provides anotha upper 
bound to the unroll factor. 

3. If there are references to previous indexed contents of 
the array such as was shown in FIGS. 10 through 12. an 
unroll factor suggested by this analysis (or a multiple of 
it) should be used. 

4. If prefetch instructions arc being generated (tiiis is 
known based on a user defined flag), then try to pick an 
unroll factor that keeps the value of the strides of anay 
references within the loop below the cache line size. 

5. If the trip count is a constant known at con^>ile time, 
then an unroll factor that eliminates the need for a 
con^nsation code loop should be selected. Typically 
this would be an unroll factor of 2, 4 or 8. although 
other numbers such as 3 or 5 might be possible. 

6. If there is profile information, use that. If the profile 
informations says that the loop iterates on an average 
k times, if k' is smaller than the maximum value of the 
unroll factor as dictated by the previous steps, use 
else use the maximum value of the unroll factcx*. 

7. If there are high latency operations within the locp such 
as divide and square root operations, use an unroll 
factor ttiat will enhance the maxirmmi overiap of diese 
instructions. For instance, if the architecture has two 
divide units and the loop has a single divide instructicm, 
the loop should be unrolled an even number of times so 
that both the divide units can be kept simultaneously 
busy. 

The algorithm that computes the unroll factor tries to 
conq>ute an optimal and acceptable unroll factor. The cost of 
a nonoptimal unroll factor is slower run time code. As 
discussed above, the algorithm is sensitive to profile data, 
numbez' of instructions in the loop, architecture features like 
functional units and cache line size, interactions with other 
optimizations and ccmstant trip counts. 

Attention is directed to FIG. 13. which shows how the 
optimization algorithms presented here are implemented. At 
201 the unroll factor is determined as described above. Next, 
at 203, a check is made for the special case where the trip 
count is known at con^Ue time and is a multiple of the 
unroll factor. In this case, the unrolled loop code is generated 
from the original loop code, any middle exits are removed 
leaving only the final exit, and the unrolled code is output at 
242. Because this is an unrolled loop which needs no 
compensation code, none is output The other exit occurs at 
203 where the trip count is not known at compile time, or the 
trip counts and unroll factor are such that compensation code 
must be generated. Control goes to 205 whm the unrolled 
code is generated. For non-WHILE loops, the middle exits 
arc removed leaving only the final exit. For a WHILE loop. 
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the final exit is moved to the end of the loop (226) instead 4, The compiler of daim 1, wherein said unroll factor is 

of the middle and all other exits removed. At this time a responsive to resource usage within the loop, 

determination (at 229 j is made using the unroU factor to 5. The compiler of daim 1. wherein said unroll factor is 

determine if the compensation code should be output before responsive to prefetch distance of memory references within 

or after the unrolled loop code. If it should be after, control 5 the loop, 

goes to 244. otherwise control goes to 246. AU of these three ^ compiler of daim I. wherein said unroU factor is 

control paths then meet at 999, terminating the unrolling responsive to profile infomoation collected from previous 

optiimzaaon. executions of compiled code. 

na 14 is a block schcmauc diagram of a compiler for a 7 ^ ^ 1 therein said unroU factor is 

programmable Ji^ch^^^^ accordance with the invcntton. 10 si^e to recurrence of memory references within the 
The confer of FIG. 2h shows a loop unroUuig module 30. 

n,c ^dcxTcdcmbo^nt of *e invention provides a loop of daim I. wherein said unroll factor Is 

unrollmg module that is placed within the conmuer as . w ^ - , - *u 1 

shown in FIG, 2b. As shiwn in HG. 14. Ihe^mpilcr '^^''^ ">struclions m the loop 

comprises an analysis module 1301 for analyzing and 13 . 9. Tlie compder of claim 1, wherein a tnp count for ^ 

unrolling loops within source appUcations. An optimizo- ^^'^ ^<^^ «>™P^« optumzcr 

module 1302 determines an optimum unroll factor in module detcnnines an unroll factor that executes the oom- 

responsc to the analysis module. An unroll module 1303 pcnsation code zero times and suppresses the generation of 

generates an unrolled loop having said optimum unroll said compensation code. 

facto, while a compensation module 1304 generates and 20 10. The compiler of claim 1, wherein said placement 

places any compensation code as required as a result of loop calculation is responsive to the unroll factor computed for 

unroll optimization. the loop. 

Although the invention is described herein with reference 11. The compiler of daim 1, wherein said loop is a loop 

to the i^efeired embodiment, one skilled in the art will having an early exit. 

readily appreciate that other applications may be substituted 25 12. The compiler of claim 1. wherein loop unrc^g is 

for those set forth herein without departing from the spirit integrated with other low-level optimization phases, 

and scope of the present invention. Accordingly, the inven- 13. The compiler of daim 12. whaein said other low- 

tion should only be limited by the daims included below. level optimization phases indude any of prefetch instruction 

We claim: insertion, register reassodation, and instruction scheduling. 

1. In a programmable machine, a compiler comprising: 30 14. AmeUiod for unrolling loops, comprising the steps of: 
an analysis module for analyzing and unrolling loops determining a unroll factor; 

within source ^plications; generating an unrolled loop whidi always exits leaving a 

an optimizer module for determining an optimum unroll remaining trip count of at least one; 

factor in response to said analysis module; generating condensation code; 

an unroll module for generating an unrolled loop having detennining whether the compensation code should be 

said optimum unroll factor; and placed before the unrolled loop or after the unrolled 

a compensation module for generating and placing any loop; and 

compensation code as required as a result of loop unroll detennining if die trip count is a power of two; and if it 

optimization, 40 is placing the compensation code before the unrolled 

wherdn said coRd>ensation module performs a placenient loop, 

calculation to determine whether to put said con^n- 15. 'Hie method of claim 14, wherdn the loop to be 

sation code before the unrolled loop or after the unrolled is a loop having an early exit 

unrolled loop. 16. The method of daim 15, wherein the loop having an 

2. The compiler of claim 1, wherein said oorapcnsation 43 early exit after unrolling is transfonncdinto a loop having an 
module ensures that said compensation code is executed at exit at its end, and from which all intermediate exits have 
least once when said unrolled loop is executed been removed. 

3. The compiler of daim I. wherein said unroll factor is 

responsive to a number of instructioos in the loop. m * * * * 
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